OPEN 3 ACCESS Freely available online 



•0-PLOS I o-^E 



Dimensionality of Social Networks Using Motifs and 
Eigenvalues 

Anthony Bonato^^ David F. Gleich^^ Myunghwan Kim^, Dieter Mitsche^, Pawef Pratat\ Yanhua Tian^, 
Stephen J. Young^ 

1 Department of Mathematics, Ryerson University, Toronto, Ontario, Canada, 2 Computer Science Department, Purdue University, West Lafayette, Indiana, United States of 
America, 3 Electrical Engineering Department, Stanford University, Stanford, California, United States of America, 4Laboratoire J.A. Dieudonne, Universite de Nice Sophia- 
Antipolis, Nice, France, 5 Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada, 6 Mathematics Department, University of Louisville, 
Louisville, Kentucky, United States of America 



(D 

CrossMark 



Abstract 

We consider the dimensionality of social networks, and develop experiments aimed at predicting tliat dimension. We find 
that a social network model with nodes and linl<s sampled from an m-dimensional metric space with power-law distributed 
influence regions best fits samples from real-world networks when m scales logarithmically with the number of nodes of the 
network. This supports a logarithmic dimension hypothesis, and we provide evidence with two different social networks, 
Facebook and Linkedln. Further, we employ two different methods for confirming the hypothesis: the first uses the 
distribution of motif counts, and the second exploits the eigenvalue distribution. 



Citation: Bonato A, Gleich DF, Kim M, Mitsche D, Pratat P, et al. (2014) Dimensionality of Social Networks Using Motifs and Eigenvalues. PLoS ONE 9(9): el 06052. 
doi:l 0.1 371/joumal.pone.Ol 06052 

Editor: Satoru Hayasaka, Wake Forest School of Medicine, United States of America 
Received May 2, 2014; Accepted July 25, 2014; Published September 4, 2014 

Copyright: © 2014 Bonato et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits 
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 

Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. All of the data used to perform the statistical 
analysis and validate the finding of the authors' paper is available, as well as the code to turn the primary source information into the features. These codes and 
data suffice to reproduce all figures in the paper, all statistical analysis, and fully describe all data translation. The reference tar file is available from https://www.es. 
purdue.edu/homes/dgleich/codes/geop-dim/geop-dim.tar.gz 

Funding: NSERC DG grants; NSF CAREER award CCF-1 149756; MITACS for hosting the authors' research team at the Advances in Network Analysis and its 
Applications Workshop held at the University of British Colombia in July 2012. The funders had no role in study design, data collection and analysis, decision to 
publish, or preparation of the manuscript. 

Competing interests: The authors have declared that no competing interests exist. 
* Email: abonato@ryerson.ca (AB); dgleich@purdue.edu (DFG) 



Introduction 

Empirical studies of on-line social networks as undirected 
graphs suggest these graphs have several intrinsic properties: 
highly skewed or even power-law degree distributions [1,2], large 
local clustering [3], constant [3] or even shrinking diameter with 
network size [4], densification [4], and localized information flow 
bottlenecks [5,6]. These are challenging properties to capture in 
concise models of social network connections and growth [7-9] , 
and many models only possess them in certain parameter regimes. 
One model that captures these properties asymptotically is the 
geometric protean model (GEO-P) [10]. It differs from other 
network models [1,4,1 1,12] because all links in geometric protean 
networks arise based on an underlying metric space. This metric 
space mirrors a construction in the social sciences called Blau 
space [13]. In Blau space, agents in the social network correspond 
to points in a metric space, and the relative position of nodes 
follows the principle o{ homophily [14]: nodes with similar socio- 
demographics are closer together in the space. 

In order to accurately capture the observed properties of social 
networks — in particular, constant or shrinking diameters — the 
dimension of the underlying metric space in the GEO-P model 
must grow logarithmically with the number of nodes. The 
logarithmically scaled dimension is a property that occurs 
frequently with network models that incorporate geometry, such 
as in multiplicative attribute graphs [7] and random Apollonian 



networks [15]. Because of its prevalence in these models, the 
logarithmic relationship between the dimension of the metric 
space and the number of nodes has been called the logarithmic 
dimension hypothesis [10]. This hypothesis generalizes previous 
analysis which shows that individuals in a social network can be 
identified with relatively little information. For instance, Sweeney 
found that 87% of the U.S. population had reported attributes that 
likely made them unique using only zip code, gender and date of 
birth, and concluded that few attributes were needed to uniquely 
identity a person in the U.S. population [16]. Here, we find 
evidence of the log-dimension property in real world social 
networks. 

We emphasize that the present paper is the first study that we 
are aware of which attempts to quantify the dimensionahty of 
social networks and Blau space. While we do not claim to prove 
conclusively the logarithmic dimension hypothesis for such 
networks, our experiments, such as those of [16], suggest a much 
smaller dimension in contrast to the overall size of the networks. 
Interestingly, speculation on the low dimensionality of social 
networks arose independenfly from theoretical analysis of math- 
ematical models of social networks in [7,10,15]. 

Our findings provide evidence for dimensional properties 
underlying social networks that have a number of potential 
apphcations in future studies. First, the dimensional properties 
could be used for further classification and characterization of 
different types of networks. Second, many NP-hard optimization 
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problems related to graph properties and community detection are 
polynomial time solvable in a low dimensional metric space, and 
thus, our findings suggest new techniques to explore for 
understanding why we may expect to solve these problems in 
social networks. Finally, if techniques to find these dimensions 
emerge, we should be able to create powerful new methods to 
harness the insight they offer into the network structure. 

MGEO-P 

The particular network model we study is a simple variation on 
the GEO-P model that we name the memoryless geometric 
protean model (MGEO-P), since it enables us to approximate a 
GEO-P network without using a costiy sampling procedure. Both 
GEO-P and the MGEO-P model depends on five parameters 
described in Table 1. 

The nodes and edges of the network arise from the following 
process. Initially the network is empty. At each of n steps, a new 
node V arrives and is assigned both a random position q, in M'" 
within the unit-hypercube [0,1]"' and a random rank r,, from those 
unused ranks remaining in the set 1 to n. The influence radius of 
any node is computed based on the formula: 

I{r)J-{r-^n-Pfl"'. 

With probability the node v forms an undirected connection 
to any preexisting node u where I?(v,«)</(r,,), where the 
distances are computed with respect to the following metric: 

D(v,«) = min{ II ? „ - g„ - zlL : ze{ - 1 ,0, 1 }'" } , 

and where ||'[|^ is the infinity-norm. We note that this implies that 
the geometric space is symmetric in any point as the metric 
"wraps" around like on a torus. The volume of space influenced 
by the node is r,7°'«^'^ Then the next node arrives and repeats the 
process until all n nodes have been placed. In the MGEO-P 
model, the process ends here, whereas in the GEO-P model, the 
network then removes the least-recently added node, and inserts a 
new node following the same procedure. This iterative replace- 
ment process continues until it reaches it reaches a random point. 

Figure 1 illustrates two features of the model. First, after a few 
steps, only a few nodes exist and even a large influence region wiU 
only produce a few links. Second, when the number of steps 
approaches n, a large influence region will produce many links. 
The idea behind the model is a simple abstraction of the growth of 
an on-line social network. When the network is first growing (few 
steps), even influential members will only know a few other 
members who have also joined. But after the network has been 



Table 1. The parameters of the MGEO-P model. 







n 


the total number of nodes 


m 


the dimension of the metric space 


0<ce<l 


the attachment strength parameter 


0</i<l-K 


the density parameter 


0<;;<1 


the connection probability 
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around for a while (many steps), influential members wiU begin 
with many friends. 

We formally prove that the MGEO-P model has the following 
properties. Let ae(0,l), /?6(0,1 — a), /'e(0,l] and m be positive 
integer. The foUowing statements hold with probabflity tending to 
1 as n tends to oo . See the MGEO-P section of FUe S 1 for the 
proofs. We actually show these results hold with extremely high 
probability, which is a stronger notion that implies probability 
tending to 1. 

1. Let V be a node of MGEO — P(M,m,o;,/J,/)) with rank R that 
arrived at step t. Then 

deg(v)= ( n^-''-l' + (n-i)pR-''n-l'] 

\M — I 1 — 0£ J 

This result implies that the degree distribution follows a 

powerlaw with exponent i; = 1 H — . 

a 

2. The average degree of node of MGEO — P(«,OT,t!(,;8,/)) is 

-T^"'-'(-(\^))- 

3. The diameter of MGEO -P(«,m,a,i5,;;) is w®(»). 

This last property suggests that, ignoring constants, for a 
network with n nodes and diameter D, the expected dimension 
based on the MGEO-P model is 

logn 

m» . 

logZ) 

Thus, like some network models that incorporate geometry 
[7,15], in the MGEO-P model, the dimension m must scale 
logarithmically in order for the diameter to remain constant as n 
increases. 

Experimental Design and Graph Summaries 

Both graph motifs and spectral densities are numeric summaries 
of a graph that abstract the details of a network into a small set of 
values that are independent of the particular nodes of a network. 
These summaries have the property that isomorphic graphs have 
the same values, and we wiU use these summaries to determine the 
dimension of the metric space that best matches Facebook and 
Linkedin networks as illustrated in Figure 2. Graph motifs, 
graphlets, or graph moments are the frequency or abundance of 
specific small subgraphs in a large network. We study undirected, 
connected subgraphs up to four nodes as our graph motifs (with 
the exception of the number of edges, or two-node motifs, as the 
networks are created to preserve this count). This is a set of 8 
graphs shown in at the bottom of Figure 2 along with the single 
two node graph of an edge. The spectral density of a graph is the 
statistical distribution of eigenvalues of the normalized Laplacian 
matrix as indicated in the upper right of that figure. These 
eigenvalues indicate and summarize many network properties 
including the behavior of a uniform random walk, the number of 
connected components, an approximate connectivity measure. 
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Figure 1. An example describing the MGEO-P process on a grapli witKi 250 nodes in tlie unit square with torus metric, where a = 0.9 

and jS = 0.04 and /> = 1. Each figure shows the graph "replicated" in grey on all sides in order to illustrate the torus metric. Links are drawn to the 
closest replicated neighbor. The blue square indicates the region [0,1]^. Top row (left to right) The MGEO-P process begins with relatively few nodes, 
and thus, nodes must have large influence radii (red squares) to link anywhere. As more nodes arrive, large radii result in many connections, modeling 
influential users, and small radii result in a few connections, modeling standard users. Bottom row Illustrates the final constructed graph. 
doi:10.1371/journal.pone.0106052.g001 



and many other features [17,18]. Thus, the spectral density of the 
normalized Laplacian is a particularly helpful characterization that 
captures many such separate network properties. 

We study dimensional scaling in social networks by comparing 
samples of the MGEO-P networks of varying dimensions with 
samples of social network data from Facebook and Linkedln. We 
pay particular attention to the relationship between the number of 
nodes n of the network and the dimension m of the best fit 
MGEO-P network. In order to determine what underlying 
dimension for MGEO-P best fits a given graph, we employ two 
distinct methods. For one experiment, we use features known as 
graph motifs, graphlets, or graph moments in concert with a 
support vector machine (SVM) classifier. This approach has been 
used successfully to determine the best generative mechanism of a 
network [19] and to select parameters of a complicated network 
models to fit real-world data [9,20]. In a second experiment, we 
use spectral densities of the normalized Laplacian matrix of a 
graph and a KuUback-Leibler divergence (KL divergence) 
similarity measurement, which has been used to match protein 
networks between species [21,22]. We find evidence of the 
logarithmic dimension hypothesis in both cases. 

The data 

Facebook distributed 100 samples of social networks from 
universities within the United States measured as of September 
2005 [23], which range in size from 700 nodes to 42,000 nodes. 



We call these networks the Facebook samples. The Linkedln 
samples were created from the Linkedln connection network 
together with the creation time of each connection from May 2003 
to October 2006. To perform our experiments on networks of 
different size, we build 7 1 snapshots of the Linkedln network at 
various timestamps. We then extracted a dense subset of their 
graph at various time points that is representative of active users; 
we used the 5-core of the network for this purpose [24]. The k- 
core of a network is a maximum size subset of vertices such that all 
vertices have degree k. See Figure 3 and the fuU statistics tables of 
File SI for additional properties of these networks. In both 
networks, the number of edges per node grows at essentially the 
same rate. 

Results 

The results of our dimensional fitting for graphlets are shown in 
Figure 4 and the results of the fitting using spectral densities are in 
Figure 5. For both datasets and both types of statistics, the best-fit 
dimension scales logarithmically with the number of nodes and 
closely tracks a simple model prediction based on the diameter D 
of the network (the model curve plots m= log(«)/log(Z))). These 
experiments corroborate the logarithmic dimension hypothesis; 
although the precise fits differ as shown in Table 2. 

The most important feature of these results is that both 
methodologies show similar scaling in how the dimensionality 
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Figure 2. At left and center, we have the steps involved in fitting via graphlets; at right and center, we have the steps involved in 
fitting via spectral histogram. Throughout, red lines denote the flow of features for the IVIGEO-P networks whereas blue lines denote flow of 
features for the original networks. At the bottom, we show an enlarged representation of the 8 graphlets we use. 
doi:10.1371/journal.pone.0106052.g002 



scales with network size. There are minor differences between the 
precise predicted dimensions-for instance, the spectral density 
approach predicts slightly higher dimensions for Facebook than 
does the graphlet approach-but the results agree to a reasonable 
degree with the dimension predicted by the model: 
log(«)/ log(Z)). Also, the confidence bounds are small around 
the chosen dimension. 




Vertices 



Figure 3. The scale of the network data involved in our study 
varies over three orders of magnitude. We see similar scaling for 
both types of networks, but with slightly different offsets. For Facebook, 
log,o (edges) =1.06 log,,, (nodes) + 1.35 with i?- = 0.945; for Linkedin 
log,o (edges) =1.07 log|o (nodes) + 0.56 with 0.999. The regularity 
in the Linkedin sizes is due to our construction of those networks. 
doi:1 0.1 371/journal.pone.01 06052.g003 



Sensitivity and robustness 

We investigate the sensitivity of the graphlet results in two 
settings. If we reduce the training set size of the SVM classifier by 
using a random subset of 20% of the input training data and then 
rerun the training and classification procedure 50 times, then we 
find a distribution over dimensions that we report as a box-plot, 
shown in Figure 4. In File SI, we further study perturbation results 
that argue against these results occurring due to chance. In 
particular, we fmd that these dimensions are robust to moderate 
changes to the network structure (Figure S2 in File SI) and we fmd 
that our methodology does not predict useful dimensions of Erdos- 
Renyi random graphs or random graphs with the same degree 
distribution (Figure 1 in File SI). We do not report a precise p- 
value as there are no widely accepted nuU-models for network 
data. We study the sensitivity of the spectral densities that look for 
matches that are within 105% of the true minimum divergence. 
This defines a dimension interval around each match that is small 
for all of our examples. 

Discussion 

There is a growing body of evidence that argues for some type 
of geometric structure in social and information networks. An 
important study in this direction views networks as samples of 
geometric graphs within a hyperbolic space [25-27]. Recent work 
has further shown that hyperbolic embeddings reproduce shortest 
path metrics in real-world networks [28]. In both MGEO-P and 
hyperbolic random geometric networks, highly skewed or power- 
law degree distributions are imposed-either directiy as in MGEO- 
P, or implicitly as in the hyperbolic space scaling. These results 
further support hidden metric structures in networks by empiri- 
cally confirming a prediction about the dimension of the metric 
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Vertices Vertices 



Figure 4. Facebook dimension at left, Linkedin dimension at right. Each red dot (SVM) is the predicted dimension computed via graphlet 
features and a support vector machine classifier. For the Facebook data, we find that m = 2. 06 log («)/ log (10) — 3.00. For the Linkedin data, we find 
that m = 0.7333 log («)/ log (10)+ 1. And these are plotted as the red linear fit line. Our theoretical model predicts a dimension of log{n) / log{D) and 
we plot this as the dashed line. In each figure, we show the variance in the fitted dimension as a box-plot. We estimate the variance by using only 
20% of the original training data and repeating over 50 trials. There are only a few outliers for small dimensions. 
doi:1 0.1 371 /journal.pone.01 06052.g004 



space made by one particular model. The importance of this 
finding is that it provides new insight into how the metric space 
must behave as the network grows. Previous studies assume a fixed 
dimension metric structure and our results indicate that a variable 
dimension may be more appropriate. In practice, estimating the 
dimensions of these networks could be useful for anomaly 
detection in the network and characterizing new types of network 
data. 

Note that these results do not conclusively argue that MGEO-P 
is a perfectly accurate model for social networks; there are 
meaningful differences between the spectral histograms from 
MGEO-P and real social networks, see Figure 6. There are also 
similar differences in the graphlet counts. Our results support a 
different hypothesis. The closest MGEO-P network to a given 
social network has a metric space whose dimension scales 
logarithmically with the number of nodes. In File SI (Sensitivity 
studies section), we have determined that this property is not due 
to either the edge density or the degree distribution; thus, our 
fmdings appears to reflect a new intrinsic property of social 
networks. 

This finding suggests a number of opportunities for designing 
social network models with metric spaces that evolve in time. We 
believe that such models offer the opportunity to identify new 



properties of social network based on emergent properties of the 
models. One question to address is how the metric space and 
connection radius change, if at all, as the network grows. 
Answering this question would provide insight into the value of 
additional users of a network. Additionally, our results suggest that 
many network models that assume a frxed dimension should be 
reevaluated. 

Materials and Methods 

Powerlaw fitting 

To determine the powerlaw exponent rj, we use the Clauset- 
Shalizi-Newman power-law exponent estimator [29] as imple- 
mented by Tamas Nepusz [30] . 

Diameters 

The MGEO-P model of a network predicts that the dimension 
m should approximate log (/i)/ log (Z)), where D is the diamater. 
However, as D is sensitive to outliers we use the 99% effective 
diameter computed via an asymptotically accurate approximation 
scheme [31] as implemented in the SNAP library on 201 1-12-31. 
The effective diameter of all Facebook networks ranges between 
3.5 and 4.6, with a mean of 4.1. For the Linkedin data, the 
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Figure 5. Facebook data at left, Linkedin data at right. Each blue point (Eigen) is the dimension of the MGEO-P sample with the minimum KL- 
divergence between the graph and the MGEO-P sample. We also show any other other dimensions within 5% of this divergence value. The 
dimensions shift modestly higher for Facebook and remain almost unchanged for Linkedin. Both still are closely correlated with the theoretical 
prediction based on the model based on log («)/ log (£)) (dashed line). The linear fits to the predicted dimensions is plotted as the red linear fit line. 
doi:1 0.1 371/journal.pone.01 06052.g005 
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Table 2. Dimension scaling for Facebook and Linkedln. 





Data 


Dimension fit 


Coefficients 




95% Confidence 




a log (n)/ log (10) + 6 


a 


h 


a 


b 


Facebook 


Graphlet 


2.06 


-3.00 


(1.851, 2.264) 


(-3.821, -2.182) 




Spectral density 


1.21 


1.65 


(0.9782, 1.446) 


(0.7272, 2.578) 


Linkedln 


Graphlet 


0.98 


1.01 


(0.786, 1.178) 


(0.1591, 1.87) 




Spectral density 


0.77 


1.1 


(0.56, 0.99) 


(0.23, 1.95) 



The specific dimensional scaling lines fit to the data in Figures 4 and 5 illustrate the growth of the network is logarithmic in the number of nodes. 
doi:l 0.1 371 /journal.pone.Ol 06052.t002 



effective diameter ranges between 4.3 and 5.9, witli a mean of 5.4. 
In botli networks, larger graplis have bigger effective diameters, 
although the differences are slight and the full data is available in 
the File SI, Full statistics tables. 

Graphlets 

To compute graphlets, we employ the rand-esu sampling 
algorithm [32] as implemented in the igraph library [33]. This 
algorithm approximates the count of each subgraph via a 
stochastic search, which then depends on the probability of 
continuing to search. Thus, if the probability is near 1 then the 
scores are nearly exact, but very expensive to compute, and small 
probabilities truncate the search early to produces fast estimates. 
The value we use is 10/m. We use log-transformed output from this 
procedure in order to capture the dynamic range of the resulting 
values. 

Spectral densities 

We approximate the spectral density via a 201 -bin histogram of 
the eigenvalues of the normalized Laplacian, which all fall between 
0 and 2. (The choice of 201 was based on prior experiences with 
the spectral histograms of networks.) To compute eigenvalues of a 
network, we employ the recently developed ScaLAPACK routine 
using the MRRR algorithm [34-36]. 

SVM 

We used a multi-class support-vector machine (SVM) based 
classification tool from Weka [37] to predict the relationship 
between the graphlets and the dimension. We considered 



alternatives, such as alternating decision trees and logistic 
regression; however, we settled on the SVM approach as it has 
the most flexible classification boundary to fit the highly nonlinear 
relationships between graphlet counts and dimensions. 

Setting MGEO-P Parameters 

Consider a graph G = (V,E) that we wish to compare to an 
MGEO-P sample. The MGEO-P model depends on four 
parameters: n, m, a, and p. The choice of n is straightforward as 
we use the number of nodes of the original graph. Both a and P 
can be chosen independently of the dimension m. Specifically, 
both cc and /J determine the average degree of the network and the 
exponent of the power law in the degree distribution, up to lower- 
order terms, as shown by property 1 and property 2. By computing 
just these two simple statistics of a network-the exponent of the 
power law and the average degree-we can invert these relation- 
ships and choose these parameters. Let rj be the power-law 
exponent and p be the average degree. Then: 

o( + /?=l— log(p)/log(«) and st= ^ . 



In order to derive this simple expression, we make the simphfying 
assumption that p does not go to zero too quickly, for example 
/) = «^''*'', in which case: log (p) = (l — a — ^)log(w) + o(l) follows 
from the expression for the average degree of a MGEO-P network. 
We use the following treatment of the probability p in order to 
maximize the clustering coefficient of the network. We first generate 




Figure 6. For three of the Facebook networks, we show the eigenvalue histogram in red, the eigenvalue histogram from the best fit 
MGEO-P network in blue, and the eigenvalue histograms for samples from the other dimensions in grey. The MGEO-P model correctly 
captures the peak of the distribution around 1, but fails to completely capture the tail between 1 and 2. Thus, we see meaningful difference between 
these profiles and hence, do not suggest that MGEO-P captures all of the properties of real-world social networks. 
doi:1 0.1 371/journal.pone.01 06052.g006 
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an MGEO-P network with p = l. Then suppose that the original 
network had E = np/2 edges, we continue by randomly deleting 
edges until the output has exactly the same number of edges as the 
input network. This step can be interpreted as using the value of p 
necessary to get the same edge count as the original graph. In the 
case where there are insufficient edge.s, we leave the output from the 
MGEO-P generator untouched. This process elfectively chooses p 
as large as possible, which gives us the largest local clustering. 
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